Use a private task-local RNG for kernel launch seeds by JohnCobbler · Pull Request #3161 · JuliaGPU/CUDA.jl

JohnCobbler · 2026-06-04T13:53:02Z

Problem

make_seed(::HostKernel) draws from Julia's default RNG on every kernel launch, so launching a kernel advances the user-visible rand() stream:

Random.seed!(42); a = rand()
Random.seed!(42); @cuda kernel(); b = rand()
a != b  # surprising

Fix

Following the direction suggested in the issue ("probably better to maintain a local RNG in CUDA.jl for launching kernels"), make_seed now draws from a private task-local Xoshiro that is lazily created in task_local_storage():

launch_rng() = get!(Random.Xoshiro, task_local_storage(),
                    :CUDACore_launch_rng)::Random.Xoshiro

make_seed(::HostKernel) = rand(launch_rng(), UInt32)

Xoshiro() seeds itself from system entropy without touching the default RNG, so the launch RNG is fully decoupled from the default RNG in both directions:

launches no longer perturb the user's rand() stream, and
Random.seed!(...) no longer influences which seeds reach kernels (device-side reproducibility remains available via in-kernel Random.seed!, which is unchanged).

Task-local storage was chosen over a module-global RNG to stay thread-safe without locking on the launch path, mirroring how Julia's own default RNG is per-task. The get!(factory, dict, key) form matches the existing devices() helpers in lib/cublas / lib/cusolver, and the ::Random.Xoshiro assertion keeps the launch path type-stable (0 allocations on the fast path). Device-side seeding (make_seed(::DeviceKernel) and device/random.jl's Philox2x32) is untouched.

An alternative considered: a global atomic counter (Philox keys only need to be distinct, not random, so sequential keys would give independent streams). That would make seeds deterministic across runs, which is a bigger semantic change — happy to switch if that behavior is preferred.

Testing

Added a regression test in test/core/execution.jl asserting the host RNG stream is identical with and without an interleaved launch, and that consecutive launches still receive distinct seeds. Verified on a Quadro RTX 6000 (sm_75, Julia 1.11.9): new testset passes (6/6) and the existing device-side RNG tests are unaffected (11/11).

Possible follow-up (out of scope here)

If deterministic launch seeds are ever wanted (e.g. for debugging), a CUDA.seed_launch!(seed) that reseeds the task-local launch RNG would slot in cleanly on top of this — happy to do that as a separate PR if there's interest.

maleadt · 2026-06-05T07:38:50Z

If deterministic launch seeds are ever wanted (e.g. for debugging), a CUDA.seed_launch!(seed) that reseeds the task-local launch RNG would slot in cleanly on top of this — happy to do that as a separate PR if there's interest.

Not needed. The user can always put a seed! call in the kernel.

maleadt · 2026-06-05T07:40:26Z

+# Task-local RNG used solely to seed kernel launches.  Drawing launch seeds
+# from a private RNG (rather than the task-global default) ensures that a kernel
+# launch never perturbs the user-visible `rand()` stream.  Lazily created per
+# task; `Xoshiro()` seeds itself from system entropy without touching the
+# default RNG.  (JuliaGPU/CUDA.jl#2417)


Please de-LLM some of these comments: no need to refer to the previous state, #2417 was a TODO so not worth pointing to, etc.

codecov · 2026-06-05T09:53:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.32%. Comparing base (aa47d7a) to head (88072e4).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3161      +/-   ##
==========================================
- Coverage   16.33%   16.32%   -0.02%     
==========================================
  Files         124      124              
  Lines        9875     9875              
==========================================
- Hits         1613     1612       -1     
- Misses       8262     8263       +1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `88072e4`	Previous: `aa47d7a`	Ratio
`array/accumulate/Float32/1d`	`99401` ns	`98976` ns	`1.00`
`array/accumulate/Float32/dims=1`	`75801` ns	`75470` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1596618` ns	`1595788` ns	`1.00`
`array/accumulate/Float32/dims=2`	`140800` ns	`140419` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`653847` ns	`653444` ns	`1.00`
`array/accumulate/Int64/1d`	`118385` ns	`118155` ns	`1.00`
`array/accumulate/Int64/dims=1`	`79035` ns	`78907` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1708144` ns	`1709506` ns	`1.00`
`array/accumulate/Int64/dims=2`	`153801` ns	`153939` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`959403` ns	`959330` ns	`1.00`
`array/broadcast`	`18271` ns	`18270` ns	`1.00`
`array/construct`	`1222.3` ns	`1198.4` ns	`1.02`
`array/copy`	`16408` ns	`16676` ns	`0.98`
`array/copyto!/cpu_to_gpu`	`212320` ns	`211135` ns	`1.01`
`array/copyto!/gpu_to_cpu`	`279697` ns	`278832` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`10291` ns	`10531` ns	`0.98`
`array/iteration/findall/bool`	`133390` ns	`131993` ns	`1.01`
`array/iteration/findall/int`	`147460` ns	`146745` ns	`1.00`
`array/iteration/findfirst/bool`	`111639` ns	`111631` ns	`1.00`
`array/iteration/findfirst/int`	`112010` ns	`111858` ns	`1.00`
`array/iteration/findmin/1d`	`65743` ns	`66902` ns	`0.98`
`array/iteration/findmin/2d`	`100431` ns	`100550` ns	`1.00`
`array/iteration/logical`	`191970` ns	`189124` ns	`1.02`
`array/iteration/scalar`	`64769` ns	`66015` ns	`0.98`
`array/permutedims/2d`	`49410` ns	`49598` ns	`1.00`
`array/permutedims/3d`	`50855` ns	`50240` ns	`1.01`
`array/permutedims/4d`	`50619` ns	`50411` ns	`1.00`
`array/random/rand/Float32`	`11524` ns	`11982` ns	`0.96`
`array/random/rand/Int64`	`24234` ns	`23515` ns	`1.03`
`array/random/rand!/Float32`	`7935` ns	`8122` ns	`0.98`
`array/random/rand!/Int64`	`20840` ns	`20501` ns	`1.02`
`array/random/randn/Float32`	`34692` ns	`34458` ns	`1.01`
`array/random/randn!/Float32`	`23818` ns	`24130` ns	`0.99`
`array/reductions/mapreduce/Float32/1d`	`33469` ns	`33763` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=1`	`38081` ns	`38228` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`49795` ns	`50249` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=2`	`55488` ns	`55439` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`67122` ns	`67163` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`39352` ns	`40187` ns	`0.98`
`array/reductions/mapreduce/Int64/dims=1`	`41051` ns	`40738` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=1L`	`86289` ns	`86458` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`57638` ns	`57724` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`82659` ns	`82751` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`33347` ns	`33392` ns	`1.00`
`array/reductions/reduce/Float32/dims=1`	`38115` ns	`38091` ns	`1.00`
`array/reductions/reduce/Float32/dims=1L`	`49855` ns	`49963` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`55304` ns	`55491` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`68959` ns	`68898` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`38812` ns	`40022` ns	`0.97`
`array/reductions/reduce/Int64/dims=1`	`40602` ns	`40672` ns	`1.00`
`array/reductions/reduce/Int64/dims=1L`	`86182` ns	`86301` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`57354` ns	`57311` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`82719` ns	`82411` ns	`1.00`
`array/reverse/1d`	`16735` ns	`16807` ns	`1.00`
`array/reverse/1dL`	`67674` ns	`67676` ns	`1.00`
`array/reverse/1dL_inplace`	`65157` ns	`65187` ns	`1.00`
`array/reverse/1d_inplace`	`8217` ns	`9321.666666666666` ns	`0.88`
`array/reverse/2d`	`19752` ns	`19959` ns	`0.99`
`array/reverse/2dL`	`71559` ns	`71879` ns	`1.00`
`array/reverse/2dL_inplace`	`64905` ns	`65104` ns	`1.00`
`array/reverse/2d_inplace`	`9515` ns	`11067` ns	`0.86`
`array/sorting/1d`	`2650235` ns	`2655417` ns	`1.00`
`array/sorting/2d`	`1040002` ns	`1038734` ns	`1.00`
`array/sorting/by`	`3192735` ns	`3192232` ns	`1.00`
`cuda/synchronization/context/auto`	`1131.6` ns	`1131.5` ns	`1.00`
`cuda/synchronization/context/blocking`	`904.6666666666666` ns	`952.2173913043479` ns	`0.95`
`cuda/synchronization/context/nonblocking`	`5874` ns	`6097.8` ns	`0.96`
`cuda/synchronization/stream/auto`	`992.25` ns	`1004.5` ns	`0.99`
`cuda/synchronization/stream/blocking`	`811.2` ns	`825.6363636363636` ns	`0.98`
`cuda/synchronization/stream/nonblocking`	`5921` ns	`6045.333333333333` ns	`0.98`
`integration/byval/reference`	`143128` ns	`143141` ns	`1.00`
`integration/byval/slices=1`	`145262` ns	`145110` ns	`1.00`
`integration/byval/slices=2`	`283680` ns	`283495` ns	`1.00`
`integration/byval/slices=3`	`422143` ns	`422045` ns	`1.00`
`integration/cudadevrt`	`101447` ns	`101557` ns	`1.00`
`integration/volumerhs`	`9078485` ns	`9077766` ns	`1.00`
`kernel/indexing`	`12466` ns	`12534` ns	`0.99`
`kernel/indexing_checked`	`13466` ns	`13291` ns	`1.01`
`kernel/launch`	`2040.2222222222222` ns	`2072.5555555555557` ns	`0.98`
`kernel/occupancy`	`699.2517006802722` ns	`716.9097744360902` ns	`0.98`
`kernel/rand`	`13666` ns	`13723` ns	`1.00`
`latency/import`	`3845165278` ns	`3841987489` ns	`1.00`
`latency/precompile`	`4621357772` ns	`4621684240` ns	`1.00`
`latency/ttfp`	`4491670603` ns	`4482964065` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

Kernel launches used to draw their seed from the default RNG, perturbing the user-visible rand() stream. Use a lazily-created task-local Xoshiro instead. Includes a regression test.

JohnCobbler force-pushed the launch-seed-rng branch from a7bf4f3 to c1144c5 Compare June 4, 2026 14:07

maleadt force-pushed the launch-seed-rng branch from c1144c5 to 7e9b958 Compare June 5, 2026 07:38

maleadt reviewed Jun 5, 2026

View reviewed changes

github-actions Bot reviewed Jun 5, 2026

View reviewed changes

JohnCobbler force-pushed the launch-seed-rng branch 2 times, most recently from 899cf3b to 88072e4 Compare June 5, 2026 13:26

fix: use a private task-local RNG for kernel launch seeds

88072e4

Kernel launches used to draw their seed from the default RNG, perturbing the user-visible rand() stream. Use a lazily-created task-local Xoshiro instead. Includes a regression test.

maleadt merged commit 112549e into JuliaGPU:main Jun 5, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a private task-local RNG for kernel launch seeds#3161

Use a private task-local RNG for kernel launch seeds#3161
maleadt merged 1 commit into
JuliaGPU:mainfrom
JohnCobbler:launch-seed-rng

JohnCobbler commented Jun 4, 2026 •

edited

Loading

Uh oh!

maleadt commented Jun 5, 2026

Uh oh!

maleadt Jun 5, 2026

Uh oh!

codecov Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JohnCobbler commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Testing

Possible follow-up (out of scope here)

Uh oh!

maleadt commented Jun 5, 2026

Uh oh!

maleadt Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JohnCobbler commented Jun 4, 2026 •

edited

Loading

codecov Bot commented Jun 5, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading